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Abstract 

A constrained £i minimization method is proposed for estimating a sparse 
inverse covariance matrix based on a sample of n iid p-variate random vari- 
ables. The resulting estimator is shown to enjoy a number of desirable prop- 
erties. In particular, it is shown that the rate of convergence between the 
estimator and the true s-sparse precision matrix under the spectral norm is 



s^J\ogp/n when the population distribution has either exponential- type tails 
or polynomial-type tails. Convergence rates under the elementwise ^oo norm 
and Frobenius norm are also presented. In addition, graphical model selection 
is considered. The procedure is easily implementable by linear programming. 
Numerical performance of the estimator is investigated using both simulated 
and real data. In particular, the procedure is applied to analyze a breast 
cancer dataset. The procedure performs favorably in comparison to existing 
methods. 
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1 Introduction 



Estimation of covariance matrix and its inverse is an important problem in many 
areas of statistical analysis. Among many interesting examples are principal compo- 
nent analysis, linear/quadratic discriminant analysis, and graphical models. Stable 
and accurate covariance estimation is becoming increasingly more important in the 
high dimensional setting where the dimension p can be much larger than the sample 
size n. In this setting classical methods and results based on fixed p and large n are 
no longer applicable. An additional challenge in the high dimensional setting is the 
computational costs. It is important that estimation procedures are computationally 
effective so that they can be used in high dimensional applications. 

Let X = (Xi, . . . , Xp) be a p-variate random vector with covariance matrix Sq 
and precision matrix Qq := Sq ^. Given an independent and identically distributed 
random sample {Xi, . . . , X„} from the distribution of X, the most natural estimator 
of So is perhaps 



where X = Y12=i-^k- However, S„ is singular if p > n, and thus is unstable 
for estimating Sq, not to mention that one cannot use its inverse to estimate the 
precision matrix fio- Iii order to estimate the covariance matrix Sq consistently, 
special structures are usually imposed and various estimators have been introduced 
under these assumptions. When the variables exhibit a certain ordering structure, 
which is often the case for time series data, Bickel and Levina (2008a) proved that 
banding the sample covariance matrix leads to a consistent estimator. Cai, Zhang 
and Zhou (2010) established the minimax rate of convergence and introduced a 
rate-optimal tapering estimator. El Karoui (2008) and Bickel and Levina (2008b) 




k=l 
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proposed thresholding of the sample covariance matrix for estimating a class of 
sparse covariance matrices and obtained rates of convergence for the thresholding 
estimators. 

Estimation of the precision matrix Qq is more involved due to the lack of a natu- 
ral pivotal estimator like S„. Assuming certain ordering structures, methods based 
on banding the Cholesky factor of the inverse have been proposed and studied. See, 
e.g., Wu and Pourahmadi (2003), Huang et al. (2006), Bickel and Levina (2008b). 
Penalized likelihood methods have also been introduced for estimating sparse pre- 
cision matrices. In particular, the ii penalized normal likelihood estimator and its 
variants, which shall be called £i-MLE type estimators, were considered in several 
papers; see, for example. Yuan and Lin (2007), Friedman et al. (2008), d'Aspremont 
et al. (2008), and Rothman et al. (2008). Convergence rate under the Frobenius 
norm loss was given in Rothman et al. (2008). Yuan (2009) derived the rates of con- 
vergence for subgaussian distributions. Under more restrictive conditions such as 
mutual incoherence or irrepresentable conditions, Ravikumar et al. (2008) obtained 
the rates of convergence in the elementwise i^o norm and spectral norm. Nonconvex 
penalties, usually computationally more demanding, have also been considered un- 
der the same normal likelihood model. For example. Lam and Fan (2009) and Fan 
et al. (2009) considered penalizing the normal likelihood with the nonconvex SCAD 
penalty. The main goal is to ameliorate the bias problem due to ii penalization. 

A closely related problem is the recovery of the support of the precision matrix, 
which is strongly connected to the selection of graphical models. To be more specific, 
let G = {V, E) be a graph representing conditional independence relations between 
components of X. The vertex set V has p components Xi, . . . , Xp and the edge set 
E consists of ordered pairs where E E if there is an edge between Xi 

and Xj. The edge between X^ and Xj is excluded from E if and only if Xj and 
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Xj are independent given (Xk,k 7^ If X ~ A^(/Xq,Eo), then the conditional 

independence between Xi and Xj given other variables is equivalent to = 0, 
where we set Qq = Hence, for Gaussian distributions, recovering the structure 

of the graph G is equivalent to the estimation of the support of the precision matrix 
(Lauritzen (1996)). A recent paper by Liu et al. (2009) showed that for a class of non- 
Gaussian distribution called nonparanormal distribution, the problem of estimating 
the graph can also be reduced to the estimation of the precision matrix. In an 
important paper, Meinshausen and Biihlmann (2006) demonstrated convincingly 
a neighborhood selection approach to recover the support of fig in a row by row 
fashion. Yuan (2009) replaced the lasso selection by a Dantzig type modification, 
where first the ratios between the off-diagonal elements Uij and the corresponding 
diagonal element uu were estimated for each row i and then the diagonal entries Uu 
were obtained given the estimated ratios. Convergence rates under the matrix ii 
norm and spectral norm losses were established. 

In the present paper, we study estimation of the precision matrix VIq for both 
sparse and non-sparse matrices, without restricting to a specific sparsity pattern. In 
addition, graphical model selection is also considered. A new method of constrained 
£i-minimization for inverse matrix estimation (CLIME) is introduced. Rates of 
convergence in spectral norm as well as elementwise £00 norm and Frobenius norm 
are established under weaker assumptions, and are shown to be faster than those 
given for the £i-MLE estimators when the population distribution has polynomial- 
type tails. A matrix is called s-sparse if there are at most s non-zero elements 
on each row. It is shown that when fio is s-sparse and X has either exponential- 
type or polynomial-type tails, the error between our estimator Cl and flo satisfies 
||r2 — fiolh = Op{s^y\ogp/n) and {Cl — r2o|oo = Op{>/]ogp/n) , where || ■ II2 and 
I ■ loo are the spectral norm and elementwise /qo norm respectively. Properties of the 
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CLIME estimator for estimating banded precision matrices are also discussed. The 
CLIME method can also be adopted for the selection of graphical models, with an 
additional thresholding step. The elementwise i^o norm result is instrumental for 
graphical model selection. 

In addition to its desirable theoretical properties, the CLIME estimator is com- 
putationally very attractive for high dimensional data. It can be obtained one 
column at a time by solving a linear program, and the resulting matrix estimator 
is formed by combining the vector solutions (after a simple symmetrization) . No 
outer iterations are needed and the algorithm is easily scalable. An R package of 
our method has been developed and is publicly available on the web. Numerical 
performance of the estimator is investigated using both simulated and real data. 
In particular, the procedure is applied to analyze a breast cancer dataset. Results 
show that the procedure performs favorably in comparison to existing methods. 

The rest of the paper is organized as follows. In Section [21 after basic nota- 
tions and definitions are introduced, we present the CLIME estimator. Theoretical 
properties including the rates of convergence are established in Section |3l Graphical 
model selection is discussed in Section HI Numerical performance of the CLIME 
estimator is considered in Section |5] through simulation studies and a real data 
analysis. Further discussions on the connections and differences of our results with 
other related work are given in Section [61 The proofs of the main results are given 
in Section [3 

2 Estimation via Constrained li Minimization 

In compressed sensing and high dimensional linear regression literature, it is now 
well understood that constrained £i minimization provides an effective way for re- 
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constructing a sparse signal. See, for example, Donoho et al. (2006) and Candes 
and Tao (2007). A particularly simple and elementary analysis of constrained £i 
minimization methods is given in Cai, Wang and Xu (2010). 

In this section, we introduce a method of constrained ii minimization for in- 
verse covariance matrix estimation. We begin with basic notations and definitions. 
Throughout, for a vector a = (ai, . . . , a^)'^ G -K^, define |a|i = Yl^=i l^b = 

■\JY1^=i ^"j- -^o^ ^ matrix A = [aij) G iR^^^, we define the elementwise l^o norm 
|A|oo = maxi<j<p^i<j<g I ajj I, the spectral norm ||A||2 = sup|x|2<i |Ax|2, the matrix 
£i norm = niaxi<j<g ^^^^^ |ajj|, the Frobenius norm \\A\\f = ^Ij^ 

the elementwise ii norm || A||i = XliLi Yl'j=i I'^ijl- ^ denotes a.pxp identity matrix. 
For any two index sets T and T' and matrix A, we use Aj,^/ to denote the |T| x |T'| 
matrix with rows and columns of A indexed by T and T' respectively. The notation 
A :^ means that A is positive definite. 

We now define our CLIME estimator. Let {^i} be the solution set of the 
following optimization problem: 

min subject to: |S„ri - J|oo < A^, ne Mf""^, (1) 

where A„ is a tuning parameter. In ([1]), we do not impose the symmetry condition 
on f2 and as a result the solution is not symmetric in general. The final CLIME 
estimator of fio is obtained by symmetrizing Cli as follows. Write fii = {Cj}j) = 
(a; J, . . . , Ljp). The CLIME estimator f2 of Qq is defined as 

n = (uij), where Uij = uji = uljl{\ulj\ < + u]j{\ulj\ > (2) 

In other words, between ujj and we take the one with smaller magnitude. It is 
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clear that is a symmetric matrix. Moreover, Theorem [T] shows that it is positive 
definite with high probabihty. 

The convex program ([T]) can be further decomposed into p vector minimization 
problems. Let be a standard unit vector in with 1 in the i-th coordinate and 
in all other coordinates. For 1 < i < p, let (3^ be the solution of the following 
convex optimization problem 

min|/3|i subject to |5]„/3- e^loo < A„, (3) 

where /3 is a vector in M^. The following lemma shows that solving the optimiza- 
tion problem ([1]) is equivalent to solving the p optimization problems (|3]). That is, 
{Oi} = {B} := {(/3i, . . . , $p)}- This simple observation is useful both for imple- 
mentation and technical analysis. 

Lemma 1 Let {Cli} be the solution set of ([IP and let {B} := {(/3i, . . . , 0p)} where 
^- are solutions to ^ for i = 1, ■■■,P- Then {tli} = {B}. 

To illustrate the motivation of ([T]), let us recall the method based on ii regular- 
ized log-determinant program (cf. d'Aspremont et al. (2008), Friedman et al. (2008), 
Banerjee et al. (2008)) as follows, which shall be called Glasso after the algorithm 
that efficiently computes the solution, 

f^Giasso := argmin{(f2, S„) - logdet(ri) + A„||ri||i}. (4) 

The solution riciasso satisfies 

^Glasso ~ = 
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where Z is an element of the sub differential 9||f2Giasso||i- This leads us to consider 
the optimization problem: 

min ||ri||i subject to: \Qr^ - S„|oo < A„, 17 G M"''^ . (5) 

However, the feasible set in ([5]) is very complicated. By multiplying the constraint 
with O, such a relaxation of ([5]) leads to the convex optimization problem ([1]), which 
can be easily solved. Figured] illustrates the solution for recovering a 2 by 2 precision 
matrix and we only consider the plane x(= y) vs z for simplicity. The point 

where the feasible polygon meets the dashed diamond is the CLIME solution fl. 
Note that the log-likelihood function as in Glasso is a smooth curve as compared to 
the polygon constraint in CLIME. 

3 Rates of Convergence 

In this section we investigate the theoretical properties of the CLIME estimator 
and establish the rates of convergence under different norms. Write S„ = [pij) = 
((Ji, . . . , (jp), So = (o"° ) and EX = (/ii, . . . , /Xp). It is conventional to divide the 
technical analysis into two cases according to the moment conditions on X. 

(CI). (Exponential-type tails) Suppose that there exists some < ?7 < 1/4 
such that log p/n < rj and 

Ee^(^'-^'^^ <K<oo for all |t| < rj, for all i, 

where is a bounded constant. 
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z 




Figure 1: Plot of the elementwise ioo constrained feasible set (shaded polygon) and 
the elementwise ii norm objective (dashed diamond near the origin) from CLIME. 
The log-likelihood function as in Glasso is shown by the dotted line. 

(C2). (Polynomial-type tails) Suppose that for some 7, Ci > 0, p < Cin"' , 

and for some 5 > 

E\Xi - < K foralH. 

For £i-MLE type estimators, it is typical that the convergence rates in the case 
of polynomial-type tails are much slower than those in the case of exponential-type 
tails. See, e.g., Ravikumar et al. (2008). We shall show that our CLIME estimator 
attains the same rates of convergence under either of the two moment conditions, 
and significantly outperforms £i-MLE type estimators in the case of polynomial-type 
tails. 
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3.1 Rates of convergence under spectral norm 

We begin by considering the uniformity class of matrices: 



U 



V 

:=W(g,So(p)) = |f2:f^ ^0, < M, max V < So(p)| 



for < g < 1, where 17 =: (uJij) = (cJi, . . . , cjp). Similar parameter spaces have 
been used in Bickel and Levina (2008b) for estimating the covariance matrix Sq . 
Note that in the special case of g = 0, W(0, So{p)) is a class of So(p)-sparse matrices. 
Let 



2 

6 = max E 



{Xi- fii){Xj - fij) - al 



--: maxOij. 



The quantity 6^ is related to the variance of aij, and the maximum value 6 captures 
the overall variability of S„. It is easy to see that under either (CI) or (C2) 6* is a 
bounded constant depending only on 7, 6, K. 

The following theorem gives the rates of convergence for the CLIME estimator 
under the spectral norm loss. 

Theorem 1 Suppose that fio G ■5o(p))- 

(i). Assume (CI) holds. Let A„ = CqM ^J\ogp/n, where Co = 2?7~^(2 + r + 
^-ig2^2^2 ^ ^ 0. Then 

< CiM2-^%(p)(i^)^'"'^^', (6) 

with probability greater than 1 — 4p"^, where Ci < 2(1 + 2^"'^ + 3^^'')4^~'^Cq~'^ . 
(li). Assume (C2) holds. Let A„ = C2M^J\ogp/n, where C2 = a/(5 + t){9 + 1). 
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Then 



\\n - flolh < CsM'-'^soip) (7) 

with probability greater than 1 — o(^n^^^^ + p^'^/^^, where C3 < 2(1 + 2^^" + 
3i-«)4i-9C'i-'?. 

When M does not depend on n,p, the rates in Theorem [1] are the same as 
those for estimating Sq in Bickel and Levina (2008b). In the polynomial- type 
tails case and when g = 0, the rate in ([7j) is significantly better than the rate 
(^^oip) \J ^^'''^^n" ~~~) £i-MLE estimator obtained in Ravikumar et al. (2008). 

It would be of great interest to get the convergence rates for supfj^g^^ E||r2 — f2o||2- 
However, it is even difficult to prove the existence of the expectation of 11^2 — f^olli 
as we are dealing with the inverse matrix. We modify the estimator Q, to ensure the 
existence of such expectation and the same rates are established. Let {f2ip} be the 
solution set of the following optimization problem: 

min||f2||i subj < A„, O e iR^^^ (8) 

where S„p = S„ + pi with p > 0. Write Oip = (c^^jp)- Define the symmetrized 
estimator fip as in ([2]) by 

= i'^ijp), where Uijp = ujip = uljpl{\uljp\ < [wj^pl} + u],pl{\uljp\ > \uj]ip\}- (9) 

Clearly S^p is a feasible point, and thus we have ||f2ip||Li < ||S^p||li < P~V- The 
expectation E||r2p — O0II2 is then well-defined. The other motivation to replace I]„ 
with S„ p comes from our implementation, which computes ([1]) by the primal dual 
interior point method. One usually needs to specify a feasible initialization. When 
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p > n, it is hard to find an initial value for ([T]). For ([8]), we can simply set the initial 
value to S~i, 



'n,p- 



Theorem 2 Suppose that fig G sq{p)) and (CI) holds. Let A„ = CqM \J\ogpJn 
with Co being defined in Theorem U\ (i) and r being sufficiently large. Let p = 



\/\og p/n. If p > for some ^ > 0, then we have 

sup E\\n^ - QoWl = 0(m^-^'1sI{p)C^^^'~' 



Remark: It is not necessary to restrict p = ^J\ogp/n. In fact, from the proof we 
can see that Theorem [2] still holds for 



mm(y ,p )<p<\ (10) 

n \ n 



with any a > 0. 

When the variables of X are ordered, better rates can be obtained. Similar as 
in Bickel and Levina (2008a), we consider the following class of precision matrices: 

Uo{a,B) = ^^^l:nyO, max^{\uJij\:\i-j\>k}<B{k + l)-"hrallk>0^ 

for a > 0. Suppose the modified Cholesky factor of fio is ^^o = TD~^T, with the 
unique lower triangular matrix T and diagonal matrix D. To estimate fio, Bickel 
and Levina (2008a) used the banding method and assumed T ^Uo{<y,B). It is easy 
to see that T G V(o{a, B) implies fio ^ Uo{<^i Bi) for some constant Bi. Rather than 
assuming T E Uo{a, B), we use a more general assumption that flo E Uo{a, B). 

Theorem 3 Let Qq GUo{a, B) and A„, = CB^J\ogpJn with sufficiently large C . 
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(i). If (CI) or (C2) holds, then with probability greater than 1—0 (n ^/^-\-p' 



r/2 



) 



'0||2 — 




) 



(11) 



(a). Suppose that p > for some > 0. If (CI) holds and p = \J\ogp/n, then 



sup E\\hp-n4l = o(B- 




) 



(12) 



SioeUoia,B) 



Theorem [3] shows that our estimator has the same rate as that in Bickel and Lev- 
ina (2008a) by banding the Cholesky factor of the precision matrix for the ordered 
variables. 

3.2 Rates under norm and Frobenius norm 

We have so far focused on the performance of the estimator under the spectral norm 
loss. Rates of convergence can also be obtained under the elementwise Zoo norm and 
the Frobenius norm. 

Theorem 4 (i). Under the conditions of TheoremUl (i), we have 




with probability greater than 1 — Ap 



(a). Under the conditions of TheoremUl (H), we have 
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with probability greater than 1 — Oin -\- p "^1 



The rate in Theorem |l](ii) is significantly faster than the one obtained by Raviku- 
mar et al. (2008); see Section 3.3 for more detailed discussions. A similar rate to 
ours was obtained by Lam and Fan (2009) under the Frobenius norm. The elemen- 
twise £oo norm result will lead to the model selection consistency result to be shown 
in the next section. We now give the rates for tip — fio under expectation. 

Theorem 5 Under the conditions of Theorem\^ we have 

i^ogp- 



sup E|0,-f2olL = o(M^- 



- sup E\\(l, - = o(M'~'^so{p)C-^y"'^' 

The proofs of Theorems 1-5 rely on the following more general theorem. 

Theorem 6 Suppose that Qq G U{q, so{p)) and p > 0. If \n > ||f^o||Li(niax,jj |a 
cr°- \ + p), then we have 



\np-no\oo<4:\\^o\\L,\n, (13) 



^p-^o\\2<CMp)K~^ (14) 



and 



-Wp-fto\\l<CMp)^l-' (15) 
P 



where C4 < 2(1 + 2^"'? + 3^^i){A\\no\\L,y-'' and C5 < 4||rio||LiC'4. 
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3.3 Comparison with lasso-type estimator 

We compare our results to those of Ravikumar et al. (2008), wherein the authors 
estimated Oq by solving the following £i regularized log- determinant program: 

fl^ := argmin{(r2, S„) - logdet(r2) + A„||r2||i,off}, (16) 

where ||r2||iofr = Xli^^^j To obtain the rates of convergence in the elementwise 
ioo norm and the spectral norm, they imposed the following condition: 

Irrepresentable Condition in Ravikumar et al. (2008) There exists some 
a G (0, 1] such that 

l|rse5(r55)"lL, < 1-a, (17) 

where T = Ti^^ ® •S' is the support of flo and = {1, . . . ,p} x {1, . . . ,p} — S. 

The above assumption is particularly strong. Under this assumption, it was 
shown in Ravikumar et al. (2008) that Cl^ estimates the zero elements of fio exactly 
by zero with high probability. In fact, a similar condition to f[T7|) for Lasso with 
the covariance matrix So taking the place of the matrix F is sufficient and nearly 
necessary for recovering the support using the ordinary Lasso; see for example Mein- 
shausen and Biihlmann (2006). 

Suppose that fio is So(p)-sparse and consider subgaussian random variables 
Xi/ y/o^ with the parameter cr. In addition to (fT7|) . Ravikumar et al. (2008) assumed 
that the sample size n satisfies the bound 

n > Cisl{p){l + 8/a)2(r logp + log 4), (18) 
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where Ci = {48^2(1 + Aa^)maxi{al)max{\\i:o\\L,Kr,\\^o\\l^Kl.}}\ Under the 
aforementioned conditions, they showed that with probabihty greater than 1 — 

- f^oloo < {16v^(l + 4a^) max(a,,)(l + 8a-^)Kr} 

where = ||([So ® So]5s)~^||li. Note that their constant depends on quantities 
a and Kr, while our constant depends on M, the bound of UriolUi- They required 
( ITSj) . while we only need logp = o(n). Another substantial difference is that the 
irrepresentable condition ([T7|) is not needed for our results. 

We next compare our result to that of Ravikumar et al. (2008) under the case of 
polynomial-type tails. Suppose (C2) holds. Corollary 2 in Ravikumar et al. (2008) 
shows that if p = O (^{n/ Sq{p)}^'^^^^^^^^^'^^ for some r > 2, then with probability 
greater than 1 — l/p"^"^, 

\n^-noU = o[f—- ). 

Theorem H] shows our estimator still enjoys the order of y^\ogp/n in the case of 
polynomial-type tails. Moreover, when 7 > 1, the range p = 0{n'^) in our theorem 
is wider than their range p = o(^{n/sl{p)}^"''^^^^^^^^'^^ with r > 2. 

It is worth noting that instead of the sparse precision matrices, our estimator 
allows for a wider class of matrices. For example, the estimator is still consistent 
for the model which is not truly sparse but has many small entries. 



rlogp + log 4 
V n 
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4 Graphical Model Selection Consistency 

As mentioned in the introduction, graphical model selection is an important prob- 
lem. The constrained ii minimization procedure introduced in Section [2] for esti- 
mating fio can be modified to recover the support of fio- We introduce an addi- 
tional thresholding step based on f2. More specifically, define a threshold estimator 
rt = {uij) with 

where r„ > 4MA„ is a tuning parameter and A„ is given in Theorem [TJ 
Define 



M{n) = {sgn{uij), 1 < i,j < p}, 
A^(f2o) = {sgn(4), l<z,j<p}, 

^(f2o) = {(^,j):4^0}' 



= min 

{i,j)€Sino) 



From the elementwise i^o results established in Theorem HI with high probability, 
the resulting elements in f2 shall exceed the threshold level if the corresponding 
element in fio is large in magnitude. On the contrary, the elements of f2 outside the 
support of Qq will remain below the threshold level with high probability. Therefore, 
we have the following theorem on the threshold estimator Q. 

Theorem 7 Suppose that (CI) or (C2) holds and Q,q G W(0, So(p)). If O^i^ > 2r„, 
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then with probability greater than 1 — Oyn ^/^ -\-p ^/^j, we have A^(f2) = A^(r2o). 

The threshold estimator fl not only recovers the sparsity pattern of f2o, but also 
recovers the signs of the nonzero elements. This property is called sign consistency 
in some literature. 

The condition ^min > 27:„ is needed to ensure that nonzero elements are correctly 
retained. From Theorem HI we see that, if M does not depend on n,p, then r„ 
is of order ^yiogp/n which is the same order as in Ravikumar et al. (2008) for 
exponential-type tails, but weaker than their assumption Oj^in > q ^J^^^2^E^IK for 
polynomial-type tails. 

Based on Meinshausen and Biihlmann (2006), Zhou et al. (2009) applied adaptive 
Lasso to covariance selection in Gaussian graphical models. For X = (Xi, . . . , Xp) ~ 
A^(0, So), they regress Xi versus the other variables {Xk, k ^ i\. Xi = t^j-^j + 
Vi, where Vi is a normally distributed random variables with mean zero and the 
underlying coefficients can be shown to be /3* = — Then they use the 
adaptive Lasso to recover the support of {/?]}, which is identical to the support 
of fio- A main assumption in their paper is the restricted eigenvalue assumption on 
So which is weaker than the irrepresentable condition. Their method can recover 
the support of fio but is unable to estimate the elements in fio. Without imposing 
the unnecessary irrepresentable condition, the additional advantage of our method 
is that it not only recovers the support of flo but also provides consistency results 
under the elementwise l^o norm and the spectral norm. 

5 Numerical Results 

In this section we turn to the numerical performance of our CLIME estimator. The 
procedure is easy to implement. An R package of our method has been developed 
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and is available on the web at 

http : //stat . Wharton. upenn. edu/~tcai/paper/htinl/Precision-Matrix .html. 

The goal of this section is to first investigate the numerical performance of the 
estimator through simulation studies and then apply our method to the analysis of 
a breast cancer dataset. 

The proposed estimator Cl can be obtained in a column by column fashion as 
illustrated in Lemma 1. Hence we will focus on the numerical implementation of 
solutions to the optimization problem ([3]): 

min |/3|i subject to |S„,/3 — ej|oo < A„. 

We consider relaxation of the above, which is equivalent to the following linear 
programming problem: 

p 

min Uj 

i=i 

subject to: — f3j < Uj for all 1 < j < p 

(19) 

+ (3j < Uj for all 1 < j < p 

- alf3 + I{k = i} < Xn for all 1 < A; < p 

+ - I{k = i} <Xn for all 1 < A; < p. 

The same linear relaxation was considered in Candes and Tao (2007), and was 
shown there to be very efficient for the Dantzig selector problem in regression. To 
solve f|T9l) . we follow the primal dual interior method approach, for example see 
Boyd and Vandenberghe (2004). The resulting algorithm has comparable numerical 
performance as other numerical procedures, for example Glasso. Note that we only 
need sweep through the p columns once but Glasso does need to have an extra outer 
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layer of iterations to loop through the p columns several times by cyclical coordinate 
descent. Once fli is obtained by combining the /3's for each column, we symmetrize 
Cli by setting the entry (i, j) to be the smaller one in magnitude of two entries Cj]j 
and cjjj, for all 1 < j < as in (|2]). 

Similar to many iterative methods, our method also requires a proper initializa- 
tion within the feasible set. The initializing (3^ however cannot be simply replaced 
by the solution of the linear system Sn/3 = for each i when p > n, since is 
singular. The remedy is to add a small positive constant p (e.g. p = ^/\og p/n) 
to all the diagonal entries of the matrix S„, that is we use the p-perturbed matrix 
S„^p = S„ + pi to replace the S„ in (fT9|) . Such a perturbation does not noticeably 
affect the computational accuracy of the final solution in our numerical experiments. 
The resulting solution Clp in the perturbed problem (|8]) is shown to have all the the- 
oretical properties in Sections 3 and 4, and even better the convergence rate of the 
spectral norm under expectation is also established there for Q,p. 

In the context of high dimensional linear regression, a second stage refitting pro- 
cedure was considered in Candes and Tao (2007) to correct the biases introduced by 
the ii norm penalization. Their refitting procedure seeks the best coefficient vector, 
giving the maximum likelihood, which has the same support as the original Dantzig 
selector. Inspired by this two-stage procedure, we propose a similar two-stage pro- 
cedure to further improve the numerical performance of the CLIME estimator by 
refitting as 

O = argmin{(r2, S„) — logdet(r2)} 

where 5* = S{Q,) and fi^c = {wjj, («, j) G S^}. Here the estimator Cl minimizes 
the Bregman divergence among all symmetric positive definite matrices under the 
constraint. We shall call Cl Refitted CLIME hereafter. The bounds under the three 
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norms in Section [3] and the support recovery S{Cl) = S{fto) can also be established. 
For example, the Frobenius loss bound can be easily derived from the same approach 
used in Rothman et al. (2008) and Fan et al. (2009). Other theoretical properties 
are more involved and we leave this to future work. 

5.1 Simulations 

We now compare the numerical performance of the CLIME estimator ficLiME; the 
Refitted CLIME estimator, the Graphical Lasso f^Giasso and the SCAD fiscAD from 
Fan et al. (2009) which is defined as 

p p 

^^SCAD := argmin{(f2,E„) -logdet(f2) + V VSCADA,a(|c^^il)}. 

1=1 ]=\ 

where the SCAD function SCADA,a is proposed by Fan (1997). We use recommended 
choice a = 3.7 by Fan and Li (2001) throughout and set all A to be the same for 
all entries for simplicity. This setting for a and A is the same as that of Fan 
et al. (2009). See Fan et al. (2009) for further details on fiscAD- Note that r^ciasso 
has the equivalent performance as the SPICE estimator by Rothman et al. (2008) 
according to their study. 

We consider three models as follows: 

• Model 1. = 0.6l^"^l. 

• Model 2. The second model comes from Rothman et al. (2008). We let 
rio = B + 61, where each off-diagonal entry in B is generated independently 
and equals to 0.5 with probability 0.1 or with probability 0.9. 6 is chosen 
such that the conditional number (the ratio of maximal and minimal singular 
values of a matrix) is equal to p. Finally, the matrix is standardized to have 
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unit diagonals. 

• Model 3. In this model, we consider a non-sparse matrix and let Q,q have all 
off-diagonal elements 0.5 and the diagonal elements 1. 

The first model has a banded structure, and the values of the entries decay as they 
move away from the diagonal. The second is an example of a sparse matrix without 
any special sparsity patterns. The third serves as a dense matrix example. 

For each model, we generate a training sample of size n = 100 from a multivariate 
normal distribution with mean zero and covariance matrix Sq, and an independent 
sample of size 100 from the same distribution for validating the tuning parameter 
A. Using the training data, a series of estimators with 50 different values of A are 
computed, and the one with the smallest likelihood loss on the validation sample is 
used, where the likelihood loss is defined by 

n) = {n, s) - logdet(ri). 

The Glasso and SCAD estimators are computed on the same training and testing 
data using the same cross validation scheme. We consider different values of p = 
30, 60, 90, 120, 200 and rephcate 100 times. 

The estimation quality is first measured by the following matrix norms: the 
operator norm, the matrix ii norm and the Frobenius norm. Table [1] reports the 
averages and standard errors of these losses. 

We see that CLIME nearly uniformly outperforms Glasso. The improvement 
tends to be slightly more significant for sparse models when p is large, but overall the 
improvement is not dramatic. Among the three methods, SCAD is computationally 
most costly, but numerically it has the best performance among the three when 
p < n and is comparable to CLIME when p is large. Note that SCAD employs a 
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Table 1: Comparison of average(SE) matrix losses for three models over 100 replications. 



Operator norm 







Model 1 






Model 2 






Model 3 




p 


f^CLIME 


^Glasso 


f^SCAD 


f^CLIME 


^Glasso 


f^SCAD 


^^CLIME 


^Glasso 


f^SCAD 


30 


2.28(0.02) 


2.48(0.01) 


2.38(0.02) 


0.74(0.01) 


0.77(0.01) 


0.59(0.02) 


14.95(0.004) 


14.96(0.004) 


14.97(0.002) 


60 


2.79(0.01) 


2.93(0.01) 


2.71(0.01) 


1.13(0.01) 


1.12(0.01) 


0.95(0.01) 


30.01(0.002) 


30.02(0.002) 


29.98(0.001) 


90 


2.97(0.01) 


3.07(0.004) 


2.76(0.004) 


1.69(0.01) 


1.49(0.004) 


1.14(0.01) 


45.01(0.002) 


45.03(0.001) 


44.98(0.001) 


120 


3.08(0.004) 


3.14(0.003) 


2.79(0.004) 


2.16(0.01) 


1.82(0.003) 


1.38(0.01) 


60.01(0.002) 


60.04(0.001) 


58.40(0.10) 


200 


3.17(0.01) 


3.25(0.002) 


2.83(0.003) 


2.36(0.01) 


2.46(0.002) 


2.11(0.01) 


100.02(0.001) 


100.08(0.001) 


96.69(0.01) 


Matrix £i-norm 






Model 1 






Model 2 






Model 3 




P 


f^CLIME 


^Glasso 


f^SCAD 


f^CLIME 


^Glasso 


f^SCAD 


f^CLIME 


^Glasso 


f^SCAD 


30 


2.91(0.02) 


3.08(0.01) 


2.91(0.02) 


1.29(0.02) 


1.36(0.01) 


0.81(0.02) 


15.12(0.004) 


15.08(0.003) 


15.10(0.002) 


60 


3.32(0.01) 


3.55(0.01) 


3.11(0.01) 


2.10(0.02) 


2.11(0.02) 


1.98(0.03) 


30.17(0.002) 


30.15(0.002) 


30.12(0.002) 


90 


3.44(0.01) 


3.72(0.01) 


3.19(0.01) 


2.95(0.02) 


2.87(0.02) 


2.71(0.03) 


45.18(0.002) 


45.18(0.002) 


45.13(0.002) 


120 


3.48(0.01) 


3.81(0.01) 


3.24(0.01) 


3.69(0.02) 


3.33(0.02) 


3.32(0.03) 


60.20(0.002) 


60.20(0.003) 


60.55(0.06) 


200 


3.55(0.01) 


4.01(0.01) 


3.37(0.01) 


4.13(0.02) 


4.52(0.02) 


4.67(0.03) 


100.22(0.002) 


100.24(0.002) 


102.64(0.05) 


Frobenius norm 






Model 1 






Model 2 






Model 3 




P 


f^CLIME 


f^Glasso 


f^SCAD 


f^CLIME 


f^Glasso 


S^SCAD 


f^CLIME 


^Glasso 


f^SCAD 


30 


3.81(0.04) 


4.23(0.03) 


3.97(0.03) 


1.72(0.02) 


1.71(0.01) 


1.23(0.02) 


14.96(0.004) 


14.97(0.004) 


14.97(0.001) 


60 


6.63(0.03) 


7.14(0.02) 


6.37(0.02) 


3.33(0.02) 


3.10(0.01) 


3.11(0.01) 


30.02(0.002) 


30.02(0.002) 


29.98(0.001) 


90 


8.78(0.04) 


9.25(0.01) 


7.98(0.01) 


4.92(0.02) 


4.36(0.01) 


4.51(0.01) 


45.02(0.002) 


45.04(0.001) 


44.99(0.001) 


120 


10.58(0.02) 


10.97(0.01) 


9.31(0.01) 


6.50(0.03) 


5.50(0.01) 


5.89(0.01) 


60.01(0.001) 


60.05(0.001) 


60.60(0.08) 


200 


14.20(0.04) 


14.85(0.01) 


12.21(0.01) 


7.57(0.02) 


8.15(0.01) 


8.41(0.01) 


100.02(0.001) 


100.08(0.001) 


103.41(0.02) 



nonconvex penalty to correct the bias while CLIME currently optimizes the convex 
£i norm objective efficiently. A more comparable procedure that also corrects the 
bias is our two-stage Refitted CLIME, denoted by f^R-cLiME- Table [2] illustrates 
the improvement from bias correction, and we only list the spectral norm loss for 
reasons of space. It is clear that our Refitted CLIME estimator has comparable 
or better performance than SCAD, and our Refitted CLIME is especially favorable 
when p is large. 

Table 2: Comparison of average(SE) operator norm losses from Model 1 and 2 over 
100 replications. 





Model 1 


Model 2 


p 


f^R-CLIME 


f^SCAD 


S^R-CLIME 


f^SCAD 


30 


1.56(0.02) 


2.38(0.02) 


0.85(0.01) 


0.59(0.02) 


60 


2.15(0.01) 


2.71(0.01) 


1.14(0.01) 


0.95(0.09) 


90 


2.42(0.01) 


2.76(0.004) 


1.17(0.01) 


1.14(0.01) 


120 


2.56(0.01) 


2.79(0.004) 


1.44(0.01) 


1.38(0.01) 


200 


2.71(0.01) 


2.83(0.003) 


1.91(0.01) 


2.11(0.01) 



Gaussian graphical model selection has also received considerable attention in 
the literature. As we discussed earlier, this is equivalent to the support recovery of 
the precision matrix. The proportion of true zero (TN) and nonzero (TP) elements 
recovered by two methods are also reported here in Table [3l The numerical values 
over 10"'^ in magnitude are considered to be nonzero since the computation accuracy 
is set to be 10~^. 

It is noticeable that Glasso tends to be more noisy by including erroneous nonzero 
elements; CLIME tends to be more sparse than Glasso, which is usually favorable 
in real applications; SCAD produces the most sparse among the three but with a 
price of erroneously estimating more true nonzero entries by zero. This conclusion 
can also be reached in Figure Ej where the TPR and FPR values of 100 realizations 
of these three procedures for first two models (note that all elements in Model 3 are 
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Table 3: Comparison of average(SE) support recovery for three models over 100 replications. 



TN% 







Model 1 






Model 2 






Model 3 




p 


f^CLIME 


^Glasso 


f^SCAD 


f^CLIME 


^Glasso 


f^SCAD 


f^CLIME 


^^Glasso 


f^SCAD 


30 


78.69(0.61) 


50.65(0.75) 


99.26(0.17) 


77.41(0.86) 


64.70(0.42) 


99.10(0.08) 


N/A 


N/A 


N/A 


60 


90.37(0.27) 


69.47(0.29) 


99.86(0.03) 


85.98(0.36) 


69.44(0.21) 


96.08(0.14) 


N/A 


N/A 


N/A 


90 


94.30(0.27) 


77.62(0.20) 


99.88(0.02) 


91.15(0.17) 


71.57(0.15) 


95.98(0.11) 


N/A 


N/A 


N/A 


120 


96.45(0.06) 


81.46(0.16) 


99.91(0.01) 


94.87(0.19) 


75.33(0.10) 


95.69(0.10) 


N/A 


N/A 


N/A 


200 


97.41(0.11) 


85.36(0.11) 


99.92(0.01) 


81.74(0.26) 


66.07(0.12) 


96.97(0.05) 


N/A 


N/A 


N/A 


TP% 






Model 1 






Model 2 






Model 3 




P 


f^CLIME 


f^Glasso 


f^SCAD 


f^CLIME 


^Glasso 


f^SCAD 


f^CLIME 


f^Glasso 


S^SCAD 


30 


41.07(0.58) 


60.20(0.56) 


16.93(0.28) 


99.66(0.09) 


99.98(0.02) 


97.70(0.24) 


14.88(0.50) 


20.07(0.57) 


3.38(0.001) 


60 


25.96(0.30) 


41.72(0.32) 


12.72(0.15) 


85.10(0.36) 


96.47(0.13) 


79.81(0.44) 


6.86(0.05) 


10.49(0.20) 


1.67(0.001) 


90 


20.32(0.32) 


33.70(0.23) 


11.94(0.09) 


66.25(0.39) 


91.62(0.15) 


67.93(0.48) 


5.86(0.03) 


7.54(0.13) 


1.11(0.001) 


120 


17.16(0.09) 


29.32(0.20) 


11.57(0.07) 


42.37(0.49) 


82.45(0.15) 


54.92(0.41) 


5.11(0.02) 


6.20(0.12) 


20.63(2.47) 


200 


15.03(0.13) 


25.34(0.15) 


11.07(0.06) 


57.07(0.27) 


73.43(0.14) 


30.50(0.40) 


3.56(0.01) 


4.94(0.02) 


39.76(0.02) 



nonzero) are plotted for p = 60 as a representative example of other cases. 




0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 

False Positive Rate False Positive Rate 



(a) Model 1 (b) Model 2 

Figure 2: TPR vs FPR for p = 60. The solid, dashed and dotted lines are the average 
TPR and FPR values for CLIME, Glasso and SCAD respectively as the tuning 
parameters of these methods vary. The circles, triangles and pluses correspond to 
100 different realizations of CLIME, Glasso and SCAD respectively, with the tuning 
parameter picked by cross validation. 

To better illustrate the recovery performance elementwise, the heatmaps of the 
nonzeros identified out of 100 replications are pictured in Figure El All the heatmaps 
suggest that CLIME is more sparse than Glasso, and by visual inspection the spar- 
sity pattern recovered by CLIME has significantly better resemblance to the true 
model than Glasso. When the true model has significant nonzero elements scattered 
on the off diagonals, Glasso tends to include more nonzero elements than needed. 
SCAD produces the most sparse among the three but could again zero out more 
true nonzero entries as shown in Model 1. Similar patterns are observed in our 
experiments for other values of p. 

5.2 Analysis of a breast cancer dataset 

We now apply our method CLIME on a real data example. The breast cancer data 
were analyzed by Hess et al. (2006) and are available at 
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Model 1 




\ 




\ 



(a) Truth 



(b) CLIME 



(c) Glasso 



(d) SCAD 



Model 2 




(e) Truth 



(f) CLIME 



(g) Glasso 



(h) SCAD 



Figure 3: Heatmaps of the frequency of the zeros identified for each entry of the 
precision matrix (when p = 60) out of 100 replications. White color is 100 zeros 
identified out of 100 runs, and black is 0/100. 

http://bioinformatics.mdanderson.org/. The data set consists of 22283 gene 
expression levels of 133 subjects, 34 of which have achieved pathological complete 
response (pCR) and the rest with residual disease (RD). The pCR subjects are 
considered to have high chance of cancer-free survival in the long term, and thus 
it is of great interest to study the response states of the patients (pCR or RD) to 
neoadjuvant (preoperative) chemotherapy. Based on the estimated inverse covari- 
ance matrix of the gene expression levels, we apply the linear discriminant analysis 
(LDA) to predict whether a subject can achieve the pCR state or not. 

For a fair comparison with other methods on estimating the inverse covariance 
matrix, we follow the same analysis scheme discussed in Fan et al. (2009) and the 
references therein. For completeness, we here give a brief description of these steps. 
The data are randomly divided into the training and the testing data sets. A 
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stratified sampling approach is applied to divide the data, where 5 pCR subjects 
and 16 RD subjects are randomly selected to constitute the testing data (roughly 
1/6 of the subjects in each group). The remaining subjects form the training set. 
On the training set, a two sample t test is performed between the two groups for 
each gene, and the 113 most significant genes (smallest values) are retained as the 
covariates for prediction. Note that the size of the training sample is 112, one less 
than the variable size, hence it allows us to examine the performance when p > n. 
The gene data are then standardized by the estimated standard deviation, estimated 
from the training data. Finally, following the LDA framework, the normalized gene 
expression data are assumed to be normally distributed as N{^i^, S), where the two 
groups are assumed to have the same covariance matrix S but different means fif,, 
k = 1 for pCR and /c = 2 for RD. The estimated inverse covariance Cl produced by 
different methods is used in the linear discriminant scores 



where ifk = Uk/n is the proportion of group k subjects in the training set and 
fij^ = (l/'T-fc) Siegroup fe is within-group average vector in the training set. 



The classification performance is clearly associated with the estimation accuracy 
of fl. We use the testing data set to assess the estimation performance and compare 
with the existing results in Fan et al. (2009) using the same criterion. For the tuning 
parameters, we use a 6 fold cross validation on the training data for picking A. The 
above estimation scheme is repeated 100 times. 

To compare the classification performance, specificity, sensitivity and Mathews 




(20) 



The classification rule is taken to be k{x) = argmax5;^.(a^) for k = 1,2. 
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Correlation Coefficient (MCC) criteria are used, which are defined as follows: 

TN TP 
Specificity = ^j^^pp , Sensitivity = Tf^r^^ 

TPxTN-FPxFN 
v/(TP + FP)(TP + FN)(TN + FP)(TN + FN) ' 

where TP and TN stand for true positives (pCR) and true negatives (RD) respec- 
tively, and FP and FN for false positives/negatives. The larger the criterion value, 
the better the classification performance. The averages and standard errors of the 
above criteria along with the number of nonzero entries in f2 over 100 replications 
are reported in Table |H The Glasso, Adaptive lasso and SCAD results are taken 
from Fan et al. (2009), which uses the same procedure on the same data set, except 
that we here use ficLiME in place of fl. 



Table 4: Comparison of average(SE) pCR classification errors over 100 replications. 
Glasso, Adaptive lasso and SCAD results are taken from Fan et al. (2009), Table 2. 



Method 


Specificity 


Sensitivity 


MCC 


Nonzero entries in f2 


Glasso 


0.768(0.009) 


0.630(0.021) 


0.366(0.018) 


3923(2) 


Adaptive lasso 


0.787(0.009) 


0.622(0.022) 


0.381(0.018) 


1233(1) 


SCAD 


0.794(0.009) 


0.634(0.022) 


0.402(0.020) 


674(1) 


CLIME 


0.749(0.005) 


0.806(0.017) 


0.506(0.020) 


492(7) 



It is clear that CLIME significantly outperforms on the sensitivity and is com- 
parable with other two methods on the specificity. The overall classification perfor- 
mance measured by MCC overwhelmingly favors our method CLIME, which shows 
an 25% improvement over the best alternative methods. CLIME also produced 
the most sparse matrix than all other alternatives, which is usually favorable for 
interpretation purposes on real data sets. 
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6 Discussion 



This paper develops a new constrained ii minimization method for estimating high 
dimensional precision matrices. Both the method and the analysis are relatively 
simple and straightforward, and may be extended to other related problems. More- 
over, the method and the results are not restricted to a specific sparsity pattern. 
Thus the estimator can be used to recover a wide class of matrices in theory as well 
as in applications. In particular, when applying our method to covariance selection 
in Gaussian graphical models, the theoretical results can be established without 
assuming the irrepresentable condition in Ravikumar et al. (2008), which is very 
stringent and hard to check in practice. 

Several papers, such as Yuan and Lin (2007), Rothman et al. (2008) and Raviku- 
mar et al. (2008), estimate the precision matrix by solving the optimization problem 
( IT6|) with ii penalty only on the off diagonal entries, which is slightly different from 
our starting point (j4]) presented here. One can also similarly considered the following 
optimization problem 

min ||ri||i,ofr subj |S„ri - I\^ < A„, ft E W""^ . 

Analogous results can also be established for the above estimator. We omit them 
in this paper, due to high resemblance in proof techniques and conclusions. 

There are several possible extensions for our method. For example, Zhou et 
al. (2008) considered the time varying undirected graphs and estimated S(t)~^ 
by Glasso. It would be very interesting to study the estimation of S(t)^^ by 
our method. Ravikumar and Wainwright (2009) considered high-dimensional Ising 
model selection using £i-regularized logistic regression. It would be interesting to 
apply our method to their setting as well. 
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Another important subject is to investigate the theoretical property of the tun- 
ing parameter selected by cross-validation method, though from our experiments 
CLIME is not very sensitive to the choice of the tuning parameter. An example 
of such results on cross validation can be found in Bickel and Levina (2008b) on 
thresholding. 

After this paper was submitted, it came to our attention that Zhang (2010) 
proposed a precision matrix estimator, called GMACS, which is the solution of the 
following optimization problem: 

min \\n\\L, subject to: |S„ri - /|oo < A„, fie M"''^. 

The objective function here is different from that of CLIME, and this basic version 
cannot be solved column by column and is not as easy to implement. Zhang (2010) 
considers only the Gaussian case and balls, whereas we consider subgaussian and 
polynomial-tail distributions and more general Iq balls. Also, the GMACS estimator 
requires an additional thresholding step in order for the rates to hold over £o balls. 
In contrast, CLIME does not need an additional thresholding step and the rates 
hold over general balls. 

7 Proof of Main Results 

Proof of Lemma [H Write fi = (cji, . . . , cUp), where uji G M^. The constraint 
|S„r2 — /|oo < An is equivalent to 

|I]„a;j - Ciloo < An for l<i<p. 
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Thus we have 

l^'li > lAli for 1 < « <p. (21) 
Since |S„B — I\ao < A„, by the definitions of {fii}, we have 

ll^illi < (22) 

Bv(l2B and ([22]), we have B e On the other hand, if (li ^ {B}, then there 

exists an i such that > Hence by ( pTl) we have ||r2i||i > ||B||i. This is 

in confiict with fl22|) . | 

The main results all rely on Theorem [6l which upper bounds the elementwise 
ioo norm. We will prove it first. 

Proof of Theorem [6l Let f3^ p be a solution of ([3]) by replacing S„ with S„,p. Note 
that Lemma [1] still holds for fl^ p and {/3jp} with p > 0. For notation briefness, we 
only prove the theorem for p = 0. The proof is exactly the same for general p > 0. 
By the condition in Theorem [6l 

|So — S„|oo < A„/||r2o||Li- (23) 

Then we have 

\I — S„r2o|oo = |(So — S„,)r2o|oo < llf^olUilSo — Sn|oo < A„, (24) 

where we used the inequality |AS|oo < | A|oo||-B||li for matrices A, B of appropriate 
sizes. By the definition of /3j, we can see that |/3j|i < ||f2o||Li for 1 ^ i ^ P- By 
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Lemma [H 



l^^ilU, < W^oh,. (25) 



We have 



- no)\oo < l^n^l - I\oo +\I- S„f2o|oo < 2A„. (26) 

Therefore by ([23])- ([26]), 

|So(f2i-f2o)|oo < |S„(f2i-f2o)|oo + |(5^n-So)(f^i-f^o)|oo 

< 2Xn + ll^il — riolUilSn — SqIoo < 4An. 

It follows that 

This establishes ( IT3]) by the definition in ([2]). 

We next prove f lT^ . Let t„ = |r2 — rioloo and define 



h] = {Cj,jI{\Cjij\ > 2tn}; l<i<pf - a;", h^j = hj - h]. 
By the definition ([2]) of f2, we have < I'^jli ^ I'^^jk- Then 

which implies that {h^li < I'^jli- This follows that < 2|/i^|i. So we only need 
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to upper bound \h}j\i. We have 

p 

i=l 

P P 



i=l i=l 



< (2t„)'~'So(p) +tn5^/{|Wiil > 2tn} + > 2tn} - I{\o0^j\ > 2tn}\ 



i=l i=l 
P P 



< (2t„)i-«so(p) + (t„?-''so(p) + (3t„)i-%(p) 

< (l + 2i~'' + 3i-'')t^-%(p), (27) 
where we used the following inequality: for any a, 6, c G iR, we have 

|J{a < c} - J{6 < c}| < I{\b - c| < |a - 

This completes the proof of (fT4|) . 

Finally, dU]) follows from 027]), ([13]) and the inequality || A|||, < p|| A||lJ A|oo for 
any p x p matrix. | 



Proof of Theorems [T] (i) and [4] (i). By Theorem El we only need to prove 

max l&ij - (T° | < CoVlogp/n (28) 

ij 

with probability greater than 1 — Ap~'^ under (CI). Without loss of generality, we 
assume EX = 0. Let S° := n"^ I]fc=i^fc^fc ^kij = XkiXkj - EXkiXkj. Then 
we have S„ = S° — XX"^. Let t = r|^J\ogpJn. Using the inequality — 1 — s| < 
^2gmax(s,o) f^j, ^^y. g £ g^^id letting Cki = 2 + r + ?7~^Ji ^, by basic calculations, we 
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can get 



k=l 

< exp - Cki log p + nt^ EV^i/^^"' 

< exp ^ - Cxi logp + ?7"^-ft'^ logpj 

< exp(-(r + 2) logp). 



Hence we have 



P(|S° - Soloo > v~'CKiV^ogp/nj < 2p-\ (29) 

By the simple inequality < e*^^^ for s > 0, we have Ee*'^^' < eK for all t < 77^/^. 
Let Ck2 = 2 + r + rf'^e^K"^ and a„ = C|'2(logp/?T-)^^^. As above, we can show that 



P(^|XX^|oo > 7? ^anVlogp/nj < pmaxPf y^Xfcj > rj ^CK2\/n log 

A:=l 

+p max P - ^ Xfci > r]~^CK2 log 



A:=l 

< 2p-^-\ (30) 
By ([29]), (jSO]) and the inequality Cq > V'^Cki + r] ^a„, we see that (1281) holds. | 

Proof of Theorems [T] (ii) and [4] (ii). Let 

Yki, = Xk^XkjI{\Xk^Xkj\ < y/n/{\ogpy} - EX,AjI{\Xk^X,,\ < y/n/{\ogpy'}, 

-y 'y \> 

^ kij -' kij ^ kij • 

Since := maxjj E\XkiXkj\I{\XkiXkj\ > y/n/(\ogpy} = 0(l)n~'^~^/^ we have by 
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(C2), 

n 

P ( max I y Ykij I > 2nbr, 
fc=i 

n 

< P ( max I XkiXkjI{\XkiXkj \ > ^/n/{\ogp)^}\ > nhj^ 

k=l 
n 

< P(^meixJ2\XkiXk,\I{Xl + Xl > 2^n/{\ogpf} > nbn) 

k=l 

< p(maxXl > >yn/{\ogp)A 

\ k,i / 

< pnP(^Xl > Vn/(logp)3j 

By Bernstein's inequality (cf. Bennett (1962)) and some elementary calculations, 

n 

P ( max I ^ Yki,\ > ^i9 + l){4 + T)n\ogp^ 

k=l 

n 

< p^ max P (^1 ^ Ykij I > + 1)(4 + T)n\ogp 



k=l 



< 2p^ max exp ( 



{6 + 1)(4 + T)n\ogp 

^ texp ^ - 



2nEY,% + v/(^ + 1)(64+ 16r)n/(31ogj)) 
0(l)p-"/l 



So we have 



P(|S° - Soloo > ViO + 1)(4 + r) \ogp/n + 26„) = O + p-^/^^ (3^) 
Using the same truncation argument and Bernstein's inequality, we can show that 

n 

P max I ^ Xfci I > y/maxa°^(4 + r)nlogp) = O (n'^^^ + 



k=l 
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Hence 



P(^|XX^|oo > maxa°^(4 + r) logp/nj = o(n~^/^ + p-^/^j. (32) 
Combining (EI]) and ([32]), we have 

max I ai, - a° | < ^/{e + 1){5 + T)hgp/n (33) 

ij 

with probabihty greater than 1 — (^n~^^^ j . The proof is completed by ( l33l) 

and Theorem [6l | 

Proof of Theorems [2] and O. Since S~|, is a feasible point, we have by flTU)) . 



llf^plli < llf^iplli < \\i:-%<p'max{^^,p-). 

By ( 1281) . Theorem El the fact p > and since r is large enough, we have 

sup E\\Qp — floWl = sup E||r2p — r2o||2-^{max \aij — a^A + p < Co^/logp/n} 

+ sup E||f2p - f2o||2-^{max \aij - o-° | + p > CQ\J\ogp/n] 

= o[M'-"sl(p)(^-^y'')+o(p'max(:^,p^'')p-'^) 

This proves Theorem |21 The proof of Theorem [S] is similar. | 

Proof of Theorem [3]. Let A;„ be an integer satisfying 1 < A;„, < -n.. Define 
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h] = {uijl{l < i < l<i<pf - u:l h] = hj - h). 

By the proof of Theorem El we can show that \hj\i < 2|/ij|i. Since fio &Uo{a, M), 
we have Yl,j>k,, \^%\ ^ ^K""- % Theorem HI Yl^jl^ I'^ij ~ = o(^kny^logp/nj 
with probabihty greater than 1 — (^n~^^^ + j . Theorem [3] (i) is proved by 
taking fc„ = [(n/ logp)^/'-^""''^-*]. The proof of Theorem [3] (ii) is similar as that of 
Theorem |2l | 
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